Search CORE

9 research outputs found

UNSUPERVISED DOMAIN ADAPTATION FOR SPEAKER VERIFICATION IN THE WILD

Author: Nidadavolu Satya Venkata Phani Sankar
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 16/09/2021
Field of study

Performance of automatic speaker verification (ASV) systems is very sensitive to mismatch between training (source) and testing (target) domains. The best way to address domain mismatch is to perform matched condition training – gather sufficient labeled samples from the target domain and use them in training. However, in many cases this is too expensive or impractical. Usually, gaining access to unlabeled target domain data, e.g., from open source online media, and labeled data from other domains is more feasible. This work focuses on making ASV systems robust to uncontrolled (‘wild’) conditions, with the help of some unlabeled data acquired from such conditions. Given acoustic features from both domains, we propose learning a mapping function – a deep convolutional neural network (CNN) with an encoder-decoder architecture – between features of both the domains. We explore training the network in two different scenarios: training on paired speech samples from both domains and training on unpaired data. In the former case, where the paired data is usually obtained via simulation, the CNN is treated as a nonii ABSTRACT linear regression function and is trained to minimize L2 loss between original and predicted features from target domain. We provide empirical evidence that this approach introduces distortions that affect verification performance. To address this, we explore training the CNN using adversarial loss (along with L2), which makes the predicted features indistinguishable from the original ones, and thus, improve verification performance. The above framework using simulated paired data, though effective, cannot be used to train the network on unpaired data obtained by independently sampling speech from both domains. In this case, we first train a CNN using adversarial loss to map features from target to source. We, then, map the predicted features back to the target domain using an auxiliary network, and minimize a cycle-consistency loss between the original and reconstructed target features. Our unsupervised adaptation approach complements its supervised counterpart, where adaptation is done using labeled data from both domains. We focus on three domain mismatch scenarios: (1) sampling frequency mismatch between the domains, (2) channel mismatch, and (3) robustness to far-field and noisy speech acquired from wild conditions

JScholarship

Speaker detection in the wild: Lessons learned from JSALT 2019

Author: Abdoli Sajjad
Ben-Yair Bar
Bouaziz Wassim
Bredin Hervé
Bullock Latane
Castan Diego
Chen Sizhu
Cristia Alejandrina
Dehak Najim
Du Jun
Dupoux Emmanuel
Galmant Léo
García Paola
Gill Marie-Philippe
Guo Ling
Kataria Saurabh
Lavechin Marvin
Lee Kong Aik
Nidadavolu Phani Sankar
Okabe Koji
Sun Lei
Titeux Hadrien
Villalba Jesus
Wang Xin
Publication venue: HAL CCSD
Publication date: 02/12/2019
Field of study

Submitted to ICASSP 2020This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Speaker detection in the wild: Lessons learned from JSALT 2019

Author: Abdoli Sajjad
Ben-Yair Bar
Bouaziz Wassim
Bredin Hervé
Bullock Latane
Castan Diego
Chen Sizhu
Cristia Alejandrina
Dehak Najim
Du Jun
Dupoux Emmanuel
Galmant Léo
García Paola
Gill Marie-Philippe
Guo Ling
Kataria Saurabh
Lavechin Marvin
Lee Kong Aik
Nidadavolu Phani Sankar
Okabe Koji
Sun Lei
Titeux Hadrien
Villalba Jesus
Wang Xin
Publication venue: HAL CCSD
Publication date: 01/11/2020
Field of study

International audienceThis paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection

INRIA a CCSD electronic archive server